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Chromosomal DNA is characterized by variation between individ- 
uals at the level of entire chromosomes (e.g., aneuploidy in which the 
chromosome copy number is altered), segmental changes (including 
insertions, deletions, inversions, and translocations), and changes to 
small genomic regions (including single nucleotide polymorphisms). A 
variety of alterations that occur in chromosomal DNA, many of which 
can be detected using high density single nucleotide polymorphism 
(SNP) microarrays, are linked to normal variation as well as disease 
and are therefore of particular interest. These include changes in copy 
number (deletions and duplications) and genotype (e.g., the occur- 
rence of regions of homozygosity). Hidden Markov models (HMM) 
are particularly useful for detecting such alterations, modeling the 
spatial dependence between neighboring SNPs. Here, we improve pre- 
vious approaches that utilize HMM frameworks for inference in high 
throughput SNP arrays by integrating copy number, genotype calls, 
and the corresponding measures of uncertainty when available. Us- 
ing simulated and experimental data, we, in particular, demonstrate 
how confidence scores control smoothing in a probabilistic framework. 
Software for fitting HMMs to SNP array data is available in the R 
package VanillalCE. 
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1. Introduction. Chromosomal DNA is characterized by variation be- 
tween individuals at the level of entire chromosomes (e.g., aneuploidy in 
which the chromosome copy number is altered), segmental changes (in- 
cluding insertions, deletions, inversions, and translocations), and changes 
to small genomic regions (including single nucleotide polymorphisms). A 
variety of alterations that occur in chromosomal DNA, many of which can 
be detected using high density single nucleotide polymorphism (SNP) mi- 
croarrays, are linked to normal variation as well as disease and are there- 
fore of particular interest [Shaw-Smith et al. (2004), Aguirre et al. (2004), 
Aggarwal et al. (2005), Dutt and Beroukhim (2007), Sebat et al. (2007), 
Szatmari et al. (2007)]. These include changes in copy number (deletions 
and duplications) and genotype (e.g., the occurrence of regions of homozy- 
gosity). 

Copy number variations can arise through somatic and germline events. 
While naturally occurring and often (but not always) benign, germline copy 
number variations are more abundant than previously thought [Freeman et al 
(2006), Redon et al. (2006), Eichler et al. (2007)]. On the other hand, so- 
matic copy number changes, such as gene amplifications and deletions, fre- 
quently contribute to tumorigenesis (or might be the consequence of it). 
Regions of homozygosity (i.e., long stretches of homozygous SNPs) can also 
occur through somatic and germline events. A hemizygous deletion of one 
chromosomal allele results in only one DNA copy, and therefore, SNPs in 
that region will appear as homozygous (given current genotyping technolo- 
gies that generate only biallelic calls) . The definition of loss of heterozygosity 
(LOH) refers to such a somatic event: for example, comparing a tumor and 
normal sample from the same person, any heterozygous SNPs in the nor- 
mal sample appear as homozygous SNPs in the tumor sample, in any region 
where an allele was lost. As already noted, regions of homozygosity can also 
occur through germline events. While chromosomal DNA is typically inher- 
ited from both parents, under some circumstances an individual inherits two 
copies of a chromosome from one parent. The inheritance of both homologues 
of a pair of chromosomes from only one parent can be due to autozygosity 
(homozygosity in which alleles are identical by descent) or to uniparental di- 
somy [UPD, Robinson (2000), Engel (2006)]. Autozygosity and UPD do not 
involve an aneuploidy (change in chromosomal copy number), and the re- 
gion of homozygosity may extend over an entire chromosome or segmentally 
across a subregion of a chromosome. The condition is termed uniparental 
isodisomy (iUPD) if the two copies inherited from one parent are identical, 
and results in stretches of homozygous SNPs. (If the two inherited copies 
are different homologues, the result is uniparental heterodisomy, hUPD, but 
does not result in stretches of homozygous SNPs.) In some cases, UPD is 
thought to be benign, but can also be associated with disease [Prader-Willi 
syndrome, Angelman syndrome, Beckwith- Wiedemann syndrome, see, e.g., 
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Altug-Teber et al. (2005)]. UPD can disrupt genomic imprinting, such that 
imprinted genes (expressed preferentially from the paternal or maternal al- 
leles) fail to be expressed. UPD can also cause homozygosity for autosomal 
recessive traits such as cystic fibrosis [Zlotogora (2004)]. 

A variety of technologies have been applied for the assessment of chro- 
mosomal abnormalities including conventional karyotyping (e.g., Giemsa 
staining of metaphase chromosomes) and fluorescence in situ hybridiza- 
tion (FISH). While the former only allows for the genome- wide detection 
of major chromosomal amplifications and deletions, the latter allows for the 
verification of suspected microdeletions as well as translocations and some 
duplications. Array comparative genome hybridization (aCGH) permits a 
genome-wide measurement of copy number variation using bacterial artifi- 
cial chromosome (BAC) clones deposited on a microarray. This is a high 
throughput technique, but the resolution is limited to tens or hundreds of 
thousands of base pairs and no genotype data are obtained. 

SNP microarray technology permits the genome-wide search for chro- 
mosomal abnormalities, providing genotype and copy number estimates for 
hundreds of thousands of SNPs in genomic DNA isolated from a biological 
sample. Statistical tools for the analysis of such SNP chip data are typ- 
ically employed to assess where the chromosomal changes have occurred, 
and whether or not these changes are associated with disease. Regions of in- 
terest are typically aneuploidies, that is, regions where copy number changes 
(deletions and amplifications) have occurred, or regions with unusually long 
stretches of homozygous genotypes (either naturally occurring, e.g., through 
evolutionary pressure on a DNA segment, or through loss of heterozygosity, 
LOH). 

For the analysis of SNP chip data in general, three different tiers of es- 
timation problems arise. (1) By SNP: how can we use the low-level data 
(such as the fluorescence measurements in Affymetrix SNP chips) to op- 
timally estimate the genotype and DNA copy number for each SNP in 
the array? (2) By sample: how can we borrow strength between neigh- 
boring SNPs, and infer regions of LOH and copy number changes in the 
genome of the subject studied? (3) Between samples: how can we compare 
the genotype of many subjects, infer common regions of abnormality, and, 
for example, assess differences between affected subjects and normal con- 
trols? This manuscript revolves around methods for tier 2, the assessment 
of chromosomal abnormalities in one particular sample. However, informa- 
tion derived from tier 1, in particular, uncertainty estimates of copy number 
and genotype estimates, can be critically important and will be incorpo- 
rated in the analysis. In particular, for the Affymetrix platform, originally 
described as a high-throughput assay for calling genotypes at thousands 
of SNPs [Kennedy et al. (2003)], there have been several algorithms pro- 
posed for the appropriate adjustment and pre-processing of probe-level data, 
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and the estimation of SNP-level summaries of probe-level data for geno- 
type [DM, Di et al. (2005), RLMM, Rabbee and Speed (2006), BRLMM, 
Affymetrix (2006), CRLMM, Carvalho, Speed and Irizarry (2007), SNiPer- 
HD, Hua et al. (2007)] and copy number (CNAG, Nannya et al. (2005), 
CARAT, Huang et al. (2006), PLASQ, Laframboise, Harrington and Weir 
(2007), CN-RLMM, Wang et al. (2007)]. Notably, Laframboise, Harrington 
and Weir (2007) and Wang et al. (2007) provide allele-specific estimates of 
copy number. 

We caution that, as with gene expression technologies, pre-processing of 
probe-level data is an important consideration. For instance, several recent 
papers have described fragment-length and sequence effects that may be 
introduced by the polymerase chain reaction (PCR) used to amplify the 
DNA [Nannya et al. (2005), Carvalho, Speed and Irizarry (2007)]. We as- 
sume that SNP-level summaries for each interrogated SNP have been ad- 
justed for probe-specific biases to the extent possible. Statistical models such 
as CRLMM that use Hapmap data for training have been shown to provide 
better genotype calls when the centers of the bivariate scatterplots for the 
A and B allele intensities are less well defined [Carvalho, Speed and Irizarry 
(2007)]. Genotype calls for most genotyping algorithms are concordant for 
over 99.9% of the measured SNPs in the Affymetrix 100k and 500k chips 
when performance is compared on apparently normal individuals represented 
in the HapMap study. 

Statistical methods that provide an indication of the uncertainty of the 
genotype call [e.g., based on the single to noise ratio (SNR) and log likelihood 
ratio (LLR) defined by CRLMM] can be particularly useful for statistical 
algorithms devised to infer chromosomal abnormalities. Specifically, statis- 
tical models that borrow strength from neighboring SNPs to infer loss or 
retention of heterozygosity should incorporate the uncertainty of the geno- 
type call estimate, giving less weight to genotype calls that are measured 
with high uncertainty and more weight to well-estimated genotypes. To our 
knowledge, this manuscript is the first one to address this issue. Figure 1 
illustrates why the uncertainty in genotype calls can differ substantially. Sim- 
ilarly, probe-specific biases for copy number estimates have been described 
before, see, for example, Wang et al. (2007). 

Before high-throughput SNP chips were widely available, array compara- 
tive genomic hybridization (aCGH) was the most commonly used method to 
assess DNA copy numbers, and assess regions in the genome where deletions 
or amplifications occurred in a particular sample. Thus, many statisticians 
have proposed approaches for aCGH based copy number estimation, and 
some of these proposed methods are also relevant for SNP chip based copy 
number analysis. Approaches for aCGH data include hidden Markov models 
[Fridlyand et al. (2004), Guha, Li and Neuberg (2006)], segmentation algo- 
rithms [Olshen et al. (2004), Picard et al. (2005), Venkatraman and Olshen 
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(2007)], wavelets [Hsu et al. (2005)], smoothing [Hupe et al. (2004), Eilers 
and de Menezes (2005)] regression [Houseman, Coull and Betensky (2006), 
Huang et al. (2005)], clustering [Wang et al. (2005)], and resampling 
[Lai and Zhao (2005)]. The manuscript by Lai et al. (2005) and 
Willenbrock and Fridlyand (2005) contain reviews and comparisons of the 
performances of several of these proposed methods. In addition, many useful 
extensions or alternative approaches for the above listed methods are being 
proposed. Some recent publications have confirmed that naturally occurring 
DNA copy number variations are more abundant than previously thought 
[Freeman et al. (2006), Redon et al. (2006)], which can produce outliers in 
the aCGH data. Integrating these known copy number variations as permis- 
sible outliers into a hidden Markov model to assess where abnormal copy 
number alterations have occurred has been proposed by Shah et al. (2006). 

For the statistical analysis, SNP chip data differ from array CGH data in 
two important ways: (a) SNP chips also provide information for the geno- 
type, that is, give homozygous/heterozygous SNP calls, and (b) provide a 
much denser coverage, currently generating genotype information and copy 
number estimates at locations in excess of 500,000 SNPs. The correlation 
structure between those estimates has to be an essential part of any statis- 
tical modeling approach. The most promising methods currently available 
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Fig. 1. HapMap genotype calls (the gold standard) for a bad SNP (left) and a good SNP 
(right) for 269 samples measured on Affymetrix 100k SNP chips. The HapMap consensus 
genotype call (taken to be the gold standard) is indicated by color: AA (medium grey), 
AB (white), and BB (dark grey). The separation between genotype clusters is SNP-spe- 
cific. This figure motivates an approach that incorporates uncertainty estimates to control 
smoothing. 
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are based on hidden Markov models. In particular, to infer LOH regions 
and to estimate copy numbers changes, the dChip software and methods are 
among the most widely used in the scientific literature for the analysis of 
SNP chip data. The dChip methods are based on separate hidden Markov 
Models for genotype analysis [Lin et al. (2004), Beroukhim et al. (2006)] and 
copy number [Zhao et al. (2004)] . The original dChipSNP HMM [Lin et al. 
(2004)] was devised to assess loss of heterozygosity regions (a region with 
an allelic loss, where heterozygote SNPs in a normal sample appear as ho- 
mozygote SNPs in a tumor sample). This required paired tumor and normal 
samples from the same subject. As these are often not available, an exten- 
sion of this model was proposed by Beroukhim et al. (2006) to allow for 
LOH assessment without paired samples (e.g., tumor only). Note that such 
an approach using unpaired data would also be required in settings that 
do not involve abnormal tissue, for example, when subjects with mental re- 
tardation and apparently normal controls are investigated to assess possible 
differences in the karyotypes. The dChipSNP hidden Markov Model for copy 
number assessment [Zhao et al. (2004)] is somewhat similar in nature to the 
one used for LOH analysis; see Zhao et al. (2004). 

Copy number estimates and genotype calls, however, can provide com- 
plementary information. For example, without copy number information, 
genotype calls alone would not allow for a distinction of LOH due to dele- 
tion or uniparental isodisomy (iUPD), which occurs when a subject inherits 
the same copy of a chromosome (or parts thereof) twice from one parent. 
While this has been recognized and concurrent analyses have been reported 
[see, e.g., Zhou et al. (2004, 2005) and Ninomiya et al. (2006)], these analy- 
ses were carried out separately for genotype calls and copy number estimates, 
and the results visually compared. Not until very recently has the need for 
an integrated analysis of copy number and genotype been addressed for the 
first time. Colella et al. (2007) propose a Bayesian hidden Markov model 
approach (QuantiSNP), using both genotype and copy number estimates 
to infer underlying states (deletions, amplifications, copy neutral regions of 
homozygosity, etc.) of interest. We caution though that data derived from 
cancer samples might create substantial problems for HMM based methods 
like QuantiSNP and our approach: DNA copy numbers larger than three 
are quite possible in such settings, and thus, the number of possible states 
expands dramatically. Further, noninteger copy numbers do make sense in 
tumors due to the mix of normal and abnormal cells in the sample [i.e., mo- 
saicism; see Ting et al. (2006) for an example]. In these settings, copy num- 
ber based segmentation approaches might be more promising [Olshen et al. 
(2004), Picard et al. (2005), Venkatraman and Olshen (2007)], in particular, 
as the definition of a "genotype" is unclear. In this manuscript, we propose 
a hidden Markov model for the integrated analysis of copy number and 
genotype estimates, most applicable for abnormalities as a consequence of 
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germline events. We also develop the methodology to integrate genotype and 
copy number estimate uncertainty measures, and illustrate how integrating 
such confidence scores of the SNP-level summaries in the HMM can improve 
inference for the underlying hidden states using simulated and experimental 
data. These ideas are implemented in the R package VanillalCE. 

2. Methods. In this section we describe three HMMs, dependent on 
whether genotype estimates (abbreviated GT), copy number estimates (ab- 
breviated CN), or both GT and CN are available as defined by three classes 
of objects for SNP array data [Scharpf et al. (2007)]. 

2.1. Genotype calls. Most algorithms that provide SNP-level summaries 
of genotype assume a copy number of two, and report the genotype estimates 
as such. We therefore assume throughout this paper that the GT are of the 
generic form AA or BB and AB corresponding to HOM and HET, respectively. 
The vanilla HMM with hidden states retention ($) and loss (!) of heterozy- 
gosity require specification of the initial state probability distribution, the 
emission probabilities (denoted by (3 below), and the transition probabilities 
(denoted by r below) between the true states. Commonly employed in the 
literature for the transition probability is the "instability-selection" model 
for LOH analysis [Newton et al. (1998), Beroukhim et al. (2006)] that de- 
scribes the dependencies between the underlying states of adjacent SNPs as 
a function of distance. For any two adjacent SNPs, 6 is defined as the proba- 
bility that the state of the first marker is not informative (denoted by 1°) for 
the state of the second marker. As the distance between SNPs affects this 
probability, it is modeled as 6(d) = 1 — e~ 2d , where d is a genetic or physical 
distance [e.g., 100 Mb units; see Beroukhim et al. (2006)] between adjacent 
SNPs. We assume that with probability 1 — 6(d), SNP^ is informative (de- 
noted by /) for SNP(j +1 ) and that no change in state occurs between the 
adjacent SNPs. For example, this leads to 

T ]/l (d)=P(U +1 \l l ,d) 

= P(U +1 ,l\U,d) + P(U +1 ,l c \U,d) 

(2.1) = P(\ i+1 \I,k,d)xP(I\k,d) 

+ P(U +1 \I c ,U,d)xP(I c \U,d) 
= 1 -6(d) +P(!) x 6(d), 

as the probability that the state of SNP( i+1 j is !, given that the state of 
SNP(j) with distance d was !. Also, 

(2.2) r $/l (d) = P($ l+1 \U,d) = 1 - P(! i+ i|!i,d) = 6(d) x P($). 
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-P($) and P(!) refer to the initial probabilities for $ and !, respectively. 
These initial probabilities can be set as fixed constants using knowledge 
from previous experiments, or alternatively, learned via the EM algorithm 
[Dempster, Laird and Rubin (1977)]. 

Emission probabilities for states ! and $ are estimated as 

(2.3) #(GT) ~ Binomial (p = 0.99) and /3$(GT) ~ Binomial(p = 0.7), 

where p is the probability of a homozygous genotype call. We use the above 
probabilities as defaults to reflect values typically seen in experimental data. 
In a region of retention $, about 70% of SNPs on average are homozygous, 
while in a region of loss ! all SNPs are homozygous, but genotyping errors do 
occur. Alternatively, as these probabilities are affected by the quality of the 
assay, they can also be learned via the EM algorithm. In practice, we find 
that our approaches are rather insensitive to changes in these parameters. 
It is certainly also possible to use SNP-specific homozygosity rates here if 
they are known from a reference population. Efficient computation of the 
probability of the observed sequence given the model is carried out using the 
forward algorithm as described in Rabiner (1989). The most probable state 
sequence given the model is calculated via the Viterbi algorithm [Viterbi 
(1967), Rabiner (1989)]. 

Integrating confidence estimates (ICE). When confidence estimates are 
available, the observed data at a SNP is the genotype call (GT) and the 
uncertainty measure Sg^,. The joint distribution of GT and Sg^, depends on 
the underlying state. For example, if the state for a particular SNP is !, the 
emission probability is 

(2.4) /J,{GT, S^} = /{GT | !} x f{S~ | GT, !}. 

Note that the first of the two terms on the right-hand side of equation (2.4) is 
simply the emission probability when estimates of uncertainty are not avail- 
able. The second term can be understood as a weight for the former term that 
depends on the confidence with which the call is made. The second term can 
be approximated using a density estimate of the Sg^, where the gold stan- 
dard is available. For example, using CRLMM on the 269 HapMap samples, 
the distributions of the respective uncertainty measures for all four possible 
combinations of called and true genotypes measured on the Affymetrix 100k 
SNP chips are known. We use kernel based density estimates to obtain the 
distributions of the confidence scores, given the true and called genotype 
(separately for the Xba and Hind 50k chips): 

/{ Sg^ | HOM,HOM},/{S ffl | HOM, HET}, 

( 2 -5) 

/{S— | HET, HOM}, /{S— | HET, HET}. 
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The first term in (2.5), for example, denotes the density of the scores when 
the genotype is correctly called homozygous (HOM) and the true genotype 
is homozygous (HOM). If the underlying state is !, then the true genotype 
is always HOM and we assume that 

/{S M |HOM,!} = /{S M |HOM,HOM} and 

( 2 - 6 ) - 

.MS,,,., | HET, !} = /{S — | HET,HOM}. 

If the underlying state is $, then the true genotype can be HET or HOM. 
We therefore estimate the emission probabilities for state $ as 

/%{GT,S^} 

= /{GT|$}/{S^|GT,$} 

= /{GT | $}(/{S^,HOM | GT, $} + /{S^,HET | GT,$}) 
(2.7) = /{GT | $}(/{S — | HOM, GT, $}/{HOM | GT, $} 

+ f{S~ | HET, GT, $}/{HET | GT, $}) 
= /{GT | $}(/{S^ | HOM, GT}/{HOM | GT, $} 

+ /{S- | HET, GT}/{HET | GT, $}). 

The unknown terms in equation (2.7), /{HOM | GT, $} and /{HET | 
GT, $}, are also estimated from the HapMap samples. 



2.2. Copy number. The hidden states for autosomal copy numbers are 
hemizygous deletion (\), two copies (— >), and more than two copies (/"). 
A typical, and from practical experience, quite reasonable assumption when 
only copy number is considered (applied to aCGH and SNP chip data) is that 
the logarithm of the copy number estimate, after normalization, is roughly 
normally distributed around the true log copy number [see, e.g., Zhao et al. 
(2004)], although slightly heavier tails may also be observed in practice. 
More important however is the fact that the variability is not necessarily 
constant across SNPs, which we will address in the ICE HMM. If the variance 
was assumed to be constant (as done in the vanilla HMM), this parameter 
can be learned via the EM algorithm [Dempster, Laird and Rubin (1977)], 
or estimated in a robust manner, for example, using quantiles from the 
observed data. In the examples presented here, we obtained a robust estimate 
for the standard deviation of copy number estimates using the 16th and 
84th percentiles of the log 2 transformed CN (corresponding to plus minus 
one standard deviation from the median). For a state S, the mean /ig and 
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variance a"g of the Gaussians used to describe the emission probabilities 
can be fixed at starting values, or updated by EM. In the vanilla HMM we 
assume a constant a 2 and estimate the emission probabilities for state \, 
for instance (on the log 2 scale, not divided by 2), as 

(2.8) /?\(CN) = /(CN|\) ~ N(ti s = 0, var = a 2 ). 

The transition probability for the copy number HMM is the same as the one 
described above. 

Integrating confidence estimates (ICE) . The emission probabilities for the 
HMM retains the same locationjwameters for the Gaussian, but with SNP- 
specific standard errors for the CN. For a given SNP, the emission probability 
for copy number two (— >), for example, is 

(2.9) /MCNlS^j-A^l^xS^) 2 ). 

The scalar a can be estimated from the sample at hand, or set equal to one 
if Sg^ measures the actual variability of the copy number estimate around 
the true copy number. 

2.3. Copy number and genotype. For the joint analysis of copy number 
and genotype, we extend the transition probabilities in equations (2.1) and 
(2.2) to the hidden states normal (s), amplification (-), LOH (&), and dele- 
tion ()). For the emission probabilities, we assume conditional independence 
between the copy number estimates and the genotype calls: 

(2.10) /(CN,GT|5) = /(CN|5) x /(GT|«S). 

This equation can be further simplified, as the copy number distribution 
only depends on the true copy number, and the genotype distribution only 
depends on the true underlying state being $ or !. For example, for the 
deletion state we have 

(2.11) /{CN, GT | )} = /{CN | )} x /{GT | )} = /{CN |\} x /{GT | !}. 

The terms in equation (2.11) can be estimated as described above for geno- 
type and copy number. Emission probabilities for the other states can be 
obtained similarly. 

2.4. Simulation. The simulated data are available in the Bioconductor 
package VanillalCE. The simulation comprises one subject's genotype, copy 
number, and confidences scores for 9165 SNPs on chromosome 1. A descrip- 
tion of the 5 features simulated in chromosome 1, referred to by regions A-E, 
and the underlying hidden states in these regions follows. 
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Genotype calls. With the exception of Regions A, B and C in Figure 2, 
we simulated 9165 genotypes (the approximate number of SNPs in the two 

50k SNP chips) from a Bernoulli distribution with probability 0.7 of HOM. 
Unless otherwise indicated, confidence scores for GT were obtained by ran- 
dom draws of confidence scores in the Hapmap data when the CRLMM 
GT agreed with the gold-standard as defined by consensus of the HapMap 
genotyping centers. The reference distributions were made separately for 
the Affymetrix 50k Xba and Hind chips, and hence, the confidence score 
sampled for each SNP were made respective to the chip. 

Copy number. The Affymetrix CNAT tool (version 3.0) was used to ob- 
tain CN for the 9165 SNPs from a presumably normal individual in the 
HapMap dataset (sample NA06993). Deletions and amplifications were sim- 
ulated from Gaussian distributions with location parameters log 2 (l) and 
log 2 (3), respectively. For the scale parameter, we used a robust estimate 
of the log 2 transformed copy number standard deviation, denoted by e. To 
illustrate how a confidence score such as a standard error of the copy num- 
ber estimate could be useful, we simulated standard errors from a shifted 
Gamma: T(l,2) +0.3, where 1 is the shape parameter and 2 is the rate 
parameter. To ascertain the effect of qualitatively high confidence scores on 
the ICE HMM, we scaled e by ^. Similarly, to simulate less precise CN, we 
scaled e by 2. 

Regions A-E were simulated as follows: 

• Region A contains 200 SNPs spanning a physical distance of approxi- 
mately 5 Mb. Two chromosomal segments of 99 homozygous genotypes 
are separated by a chromosomal segment of 14 kb containing two het- 
erozygous SNPs. Using a 2-state hidden Markov model and using only 
the simulated genotypes as the observed data, the true underlying states 
(number of SNPs) are ! (99), $ (2), and ! (99) for the 3 segments, re- 
spectively. We augment the genotype calls with copy number estimates 
obtained directly from the CNAT analysis of a normal Hapmap subject's 
chromosome 1. Using the 3-state HMM for copy number, the true under- 
lying state is — > (200). Modeled jointly, the true underlying state is &; 
(99), s (2) and & (99). 

• Region B contains 100 SNPs spanning a physical distance of approxi- 
mately 2 Mb. Two chromosomal segments each containing 49 SNPs are 
both in regions of a hemizygous deletion. We assigned a homozygous geno- 
type call to all 98 SNPs in the two hemizygous deletions. The two hemizy- 
gous deletions are separated by a chromosomal segment of 360 basepairs 
with copy number two. To simulate an incorrect genotype call (the true 
genotype is homozygous for the 2 SNPs on the diploid segment), confi- 
dence scores for the two heterozygous SNPs are drawn from the distri- 
bution of confidence scores when the CRLMM genotype call of HET was 
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Fig. 2. A simulated chromosome with 9165 SNPs. Top: The simulated GT with uniform 
noise added to reduce overplotting (vertical axis) plotted against physical position (horizon- 
tal axis). Bottom: A magnification of region A. Two SNPs in region A with high simulated 
confidence scores are indicated by the square plotting symbol. Regions A-E are described in 
more detail in Section 2.4- In truth, there are 4 different segments in state loss (\, indicated 
in light grey above). The predicted hidden states from the vanilla (Van) and ICE HMMs 
are denoted by color in the two bars beneath the data points. The ICE HMM detects each 
of the 4 ! segments, whereas the vanilla HMM smoothes over a segment in A containing 
two heterozygous SNPs at position 52.8 Mb. Utilizing confidence scores for the genotype 
predictions, the ICE HMM may provide more precise locations for ! breakpoints. 



incorrect. Copy number estimates and corresponding confidence scores 
(standard errors) for the hemizygous deletion were simulated as described 
above, with the exception that high confidence scores were assigned to 
the two SNPs in the chromosomal segment with normal copy number. 
The true underlying state for the genotypes in Region B is ! (100). The 
true state for the copy number in region B is \ (49), — ► (2), and \ (49). 
Modeled jointly, the true states are ) (49), s (2) and ) (49). 
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• Region C is a segment containing 100 homozygous SNPs spanning < 2 
Mb in a hemizygous deletion. The true underlying states are ! (100) in 
the genotype HMM, s (100) in the copy number HMM, and ) (100) in the 
joint HMM. 

• Region D contains two segments with copy number 3 (< 1 Mb), separated 
by a diploid segment containing 2 SNPs (9.8 kb). The two amplified frag- 
ments are < 1 Mb. The true underlying states are $ (200) in the genotype 
HMM; / (99), -» (2), and / (99) in the copy number HMM; and - (99), 
s (2) and - (99) in the joint HMM. 

• Region E contains a microdeletion spanning 5 SNPs (94 kb) and a mi- 
croamplification containing 3 SNPs (294 kb). We assigned high confidence 
scores to the copy number estimates in both regions. The true underlying 
states are ! (5) and $ (3) in the genotype HMM, \ (5) and /* (3) in the 
copy number HMM and ) (5) and - (3) in the joint HMM. 

3. Results. This section describes results obtained from fitting HMMs 
to simulated and experimental data. The HMMs are written in the statis- 
tical language R (http://www.r-project.org) using S4 classes and meth- 
ods [Chambers (1998)]. In particular, the HMM is dependent on whether 
genotype estimates (abbreviated GT), copy number estimates (abbreviated 
CN), or both GT and CN are available as defined by three classes of ob- 
jects for SNP array data [Scharpf et al. (2007)]. Organizing the statistical 
methods in this way allows more flexibility to users interested only in char- 
acterizing chromosomal abnormalities in genotype (loss of heterozygosity, 
LOH) or copy number (deletion or amplification) respectively. When both 
GT and CN are available, the HMM will distinguish between copy- neutral 
LOH and deletion-induced LOH. We use the term LOH in this context as 
an unusually long stretch of homozygous SNPs, though these regions can 
be completely naturally occurring, for example, due to evolutionary pres- 
sure on chromosomal segments. For the simulation, we simulate GT and 
CN as described in Section 2.4, analyzing the GT and CN separately and 
then jointly. For the experimental data, we use a HapMap sample with 
a previously identified region of uniparental isodisomy, a mechanism for 
copy neutral LOH. Both the simulation and experimental data are based 
on 100k Affymetrix SNP chips (comprised of the Xba and Hind 50k chips). 
All figures shown are also available in color as supplementary material at 
http: / /biostat .j hsph.edu / ~ ir uczins / publications / sm / . 

3.1. Simulated data. SNP-level summaries were obtained using a combi- 
nation of real (experimental) and simulated data for 1965 SNPs measured on 
chromosome 1 of the 50k Hind and Xba Affymetrix SNP chips, as described 
in Section 2.4 for additional details. Because the states of the HMM are 
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determined by whether genotype estimates (GT), copy number estimates 
(CN), or both GT and CN are available, we organize the results accord- 
ingly. For each example, we plot both the predictions of a HMM that uses 
only the observed SNP-level summaries as input (vanilla), and a HMM that 
integrates confidence estimates (ICE) for the SNP-level summaries. 

Genotype HMM. The hidden states for the genotype HMM are retention 
($) and loss (!) of heterozygosity. In the upper panel of Figure 2 the simu- 
lated GT are plotted with uniform noise added to reduce overplotting. The 
predicted states from the vanilla and ICE HMMs are also shown. The pre- 
dictions from the vanilla HMM are the same as the predictions of the ICE 
HMM shown, with the exception of the region (A) magnified in the lower 
panel of Figure 2, where the ICE HMM correctly identifies the $ segment. 
Both approaches miss the 5 SNP spanning microdeletion in region E, but 
otherwise correctly predict the true underlying states (see Section 2.4 for de- 
tails). In general, for both the vanilla and ICE HMMs, the Viterbi algorithm 
(conditional on other parameters of the HMM model) chooses an optimal 
sequence of states that maximizes the likelihood of the observed genotype 
calls. The predicted states reflect a trade-off between the likelihood of the 
observed genotypes given the underlying states, and the transition probabil- 
ities. Unlike the vanilla HMM, emission probabilities in the ICE HMM are 
a function of the confidence scores (as described in Section 2), and factor 
into the likelihood. Intuitively, a high confidence score at a particular SNP 
has the effect of giving more weight to the emission probability and less 
weight to the state of the neighboring SNPs when determining the optimal 
sequence of states in the Viterbi algorithm. Hence, the sequence of states 
that maximizes the likelihood of the observed genotype calls differ in the 
ICE and vanilla HMMs when the confidence scores shifts the balance be- 
tween the opposing forces of the emission and transition probabilities. In 
particular, the high confidence scores at the two heterozygous SNPs in re- 
gion A favor the emission probability for $, causing two breakpoints in this 
region of ! and, hence, a more local smoothing of the HMM. Although the 
emission probability for state $ is greater than for state ! at these two SNPs 
in the vanilla HMM, the probability of having two breakpoints in a region 
of ! for SNPs that are physically close is small as reflected in the transition 
probability. Therefore, the vanilla HMM provides a smoothing that is less 
localized, corresponding to a sequence of ! predictions in region A without 
transitions to the normal state. 

Copy number HMM. The hidden states for autosomal copy numbers are 
hemizygous deletion (\), normal (two) copies (— >), and more than two 
copies (/*). Figure 3 (upper panel) shows the CN of the simulated dataset. 
In our simulation, chromosome 1 contains three amplifications /* (two seg- 
ments in D separated by a segment with normal copy number, and one in 
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Fig. 3. Top: Copy number estimates (vertical axis) versus physical position (horizontal 
axis) for 9165 SNPs on a simulated chromosome. Bottom: A magnification of regions D, 
B, and E. High confidence scores for the copy number estimates were simulated for the 
square points in regions D, B, and E. The two bars beneath the data points in each figure 
show the predicted hidden states from the vanilla (Van) and ICE HMMs. Note that where 
the predictions differ in regions D, B, and E, the ICE correctly classified the hidden states. 
Note that the vanilla HMM also indicates a (spurious) deletion to the left of region A, not 
indicated by the ICE HMM due to high variability in those copy number estimates. 



E), and four deletions \ (two segments in B separated by a segment with 
normal copy number, and one segment each in regions C and E). Also shown 
are the predicted states from the vanilla and ICE HMMs, respectively. The 
predictions from the two HMMs differ in regions B, D, and E magnified 
in the lower panel. Without confidence estimates for the copy number, the 
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transition probabilities dominate the likelihood as specified by the emission 
probabilities, and the vanilla HMM smoothes over the two SNPs with copy 
number 2 in regions B and D, and the amplification in region E. The high 
confidence scores used in this simulation for the copy number estimates in 
these regions make the transition between states more favorable, and thus, 
the ICE HMM makes the transition back to the normal state for regions 
B and D, and detects the amplification in region E. Note that when the 
confidences scores for the CN are low, as for the 2 SNPs with copy number 
near two in the hemizygous deletion in region C, the predictions with ICE 
and vanilla are identical. Also, the vanilla HMM detects a spurious deletion 
to the left of region A. As the confidence scores for those copy number esti- 
mates were low, the likelihood specified in the ICE HMM does not favor a 
transition to a nonnormal state. 

Genotype and copy number HMM. We plot both the GT and CN in the up- 
per panel of Figure 4. By modeling GT and CN simultaneously, we expand 
the state space of the HMM to include deletion-induced LOH ()), copy neu- 
tral LOH (&), normal (s), and amplification (-). The predicted states from 
the vanilla and ICE HMMs are also shown, and differences in predictions are 
indicated in the lower panel. As before, ICE correctly classifies all SNPs into 
the respective states, while the vanilla HMM, in the absence of uncertainty 
estimates, smoothes over some loci (regions A, B, D), and fails to detect 
the amplification (with high confidence scores) in region E. In contrast, the 
vanilla HMM does detect the microdeletion in region E. The ability of the 
vanilla HMM to detect the microdeletion in this example even in the ab- 
sence of confidence scores is attributable to the additional information that 
the genotype provides: SNPs in deleted regions all appear as homozygous, 
in contrast to amplifications, where homozygous and heterozygous SNPs 
occur. Additionally, the extra genotype information may reduce the occur- 
rence of predicted deletions that are spurious. For instance, in the absence 
of information on genotype calls in Figure 3, the vanilla HMM predicts a 
small deletion to the left of region A. As heterozygous genotype estimates 
in this region are incompatible with a deletion, the vanilla HMM no longer 
predicts this region to be a deletion in Figure 4. 

3.2. Experimental data. To illustrate the HMM approaches on experi- 
mental data, we used a HapMap sample with a previously identified (but 
not experimentally confirmed) UPD in chromosome 2. The Aflymetrix tool 
CNAT (version 3.0) and the R software CRLMM were used to obtain SNP- 
level summaries of copy number and genotype respectively. We caution that 
at this point in time the GT obtained using CRLMM (or the Affymetrix 
tools) implicitly assume that the copy number is two — ideally, allele spe- 
cific estimates should be used, and methods are under development (Rafael 
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Fig. 4. Top: The CN in Figure 3 are superimposed on the GT in Figure 2. We fit 
HMMs to the joint observation sequence of CN and GT without (vanilla) and with (ICE) 
confidence scores of the SNP-level summaries. The predictions from these two HMMs are 
represented by different shades of grey in the two bars beneath the data points in each panel. 
We used square plotting symbols to indicate SNPs for which we assigned high confidence 
scores to the genotype and copy number estimates. 



Irizarry, personal communication). Also, software to obtain confidence scores 
for CN based on probe-level variability and signal-to-noise ratio on the chip 
[such as described in Wang et al. (2007)] is not yet available. However, differ- 
ences in the SNP-specific standard deviations of the CN across a reference set 
of 90 HapMap samples have previously been reported [see, e.g., Zhao et al. 
(2004)], and can be used in a straightforward manner as measures of uncer- 
tainty [specifically, using those deviations as the in equation (2.9), and 
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Fig. 5. Top: A confirmed UPD between 190 and 200 Mb is detected by both HMMs in a 
HapMap sample from the CEPH dataset. Note that the vanilla HMM incorrectly predicts 
a small deletion of 3 SNPs in the middle of this region, whereas the ICE HMM provides 
a more global smoothing of the copy number estimates. Bottom left: a magnified view of 
three possible LOH regions (not confirmed). Only the middle region (143 Mh) is identi- 
fied by both HMMs as LOH. Because the CRLMM genotype calls agree with the HapMap 
consensus, the chromosomal segment containing the two heterozygous SNPs at 140 Mb 
is not a region of LOH, as predicted by the vanilla HMM. Bottom right: magnification of 
the vanilla (top) and ICE (bottom) predictions for the feature at 150 Mb. Again, the true 
genotype calls are heterozygous, and so the ICE HMM correctly identifies the chromosomal 
segment containing the two heterozygous SNPs as normal. 



estimating the scalar a from the autosomal SNP copy number estimates in 
the sample]. 

The upper panel in Figure 5 shows CN on the vertical axis against physical 
position on chromosome 2. The region of predominantly called homozygous 
SNPs at 190-200 Mb is a previously identified UPD [Ting et al. (2006)]. 
Also shown are the predictions from the vanilla and ICE HMMs. The con- 
firmed UPD between 190 and 200 Mb is detected by both HMMs, though 
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the vanilla HMM incorrectly predicts a small deletion of 3 SNPs in the 
middle of this region, whereas the ICE HMM provides a more global (and 
correct) smoothing of the copy number estimates. Also, the vanilla HMM 
finds a spurious amplification at about 210 Mb. The lower panel on the left 
provides a magnified view of the region between 135 and 155 Mb, where the 
vanilla and ICE HMMs differ. Only the middle region (at about 143 Mb) 
is identified by both HMMs as LOH (we again stress that we use the term 
LOH here as copy neutral stretches of homozygous SNPs, naturally occur- 
ring possibly due to evolutionary pressure on this chromosomal segment). 
The chromosomal segment at about 140 Mb contains the two heterozygous 
SNPs (confirmed in the HapMap data, and called as such by CRLMM), and 
thus is not a region of LOH, as predicted by the vanilla HMM. The lower 
panel on the right further zooms in on the vanilla and ICE predictions in the 
region around 150 Mb. The two SNPs with heterozygous genotype calls at 
about 151 Mb are truly heterozygous SNPs, and therefore, the ICE HMM 
correctly identifies the chromosomal segment containing these two heterozy- 
gous SNPs as normal. Due to the abundance of markers in the segment 
around 151.25 Mb exclusively called homozygous, the ICE HMM still indi- 
cates an LOH segment. Several studies have recognized the abundance of 
short, copy-neutral, entirely homozygous regions [see, e.g., Beroukhim et al. 
(2006)]. To illustrate the prevalence of short, homozygous sequences, we fit 
the vanilla and ICE HMMs to the chromosome 2 data of the 30 CEPH trio 
parents available from HapMap (60 independent samples), and highlight 
these copy-neutral, all homozygous regions in Figure 6. Clearly visible is the 
abundance of these regions, and the enriched locations along chromosome 2 
(possibly explained by evolutionary pressure). 

3.3. A vanilla/ICE comparison. We performed additional simulations 
to contrast the performances of the vanilla and ICE HMMs. Since large 
deletions and amplifications can easily be picked up by both approaches, we 
focused on small deletions and amplifications, spanning between 2 and 10 
consecutive SNPs. Since the results were as expected, we only describe the 
effects of the copy number variability and confidence scores on the detection 
of small deletions in detail. 

The experimental data consisted of genotype calls and copy number as 
described in Section 2.4. Copy number confidence scores were obtained by 
weighting the robust estimate of the within-chip log 2 copy number standard 
deviation by the standardized SNP specific standard deviation derived from 
a reference set of 90 HapMap samples (e.g., this weight for one particular 
SNP was the ratio of the across sample standard deviation for the SNP and 
the median of all those numbers across all SNPs). Simulated in these data 
were 450 sets of copy number estimates and confidence scores for deletions 
ranging from two to ten consecutive SNPs (50 data sets for each deletion 
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Fig. 6. An image of the predictions from the vanilla HMM fit to chromosome 2 of the 
60 parental samples in the CEPH trios dataset (top). The x-and y-coordinates used for 
the image are physical position and subject, respectively. Subject NA07056 has a confirmed 
UPD at 195 Mb. Also plotted are the frequencies of LOH across the 60 samples (middle) 
and the cytoband (bottom). 

size). The locations of the deletions were randomly selected on chromosome 
1 for each data set. The copy numbers in the deletions were simulated from a 
log-normal distribution with mean zero (indicating a true DNA copy number 
equal to one), and a standard deviation equal to a scaled version of the 
SNP specific variability described above. The scalar K controlled whether 
more (K < 1) or less (K > 1) precise copy number data than average were 
encountered in the deletion. For both vanilla and ICE, we calculated for 
each simulated data set the difference in log likelihoods between making a 
transition to the state for deletion ()) from the normal state (s) for the range 
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Fig. 7. Differences between the log likelihoods for the correct and incorrect state sequences 
for the vanilla (light grey) and ICE (dark grey) HMMs are indicated in the upper panels. 
The differences are shown for deletions of different sizes (horizontal axis), and four differ- 
ent scale parameters K for the copy number estimate variability in the simulated deletions 
(0.4, 0. 7, 1.0, 1.3, left to right). The data were scaled to fit the panels, and slightly smoothed 
from the raw data by exploiting an obvious mean and variance relationship. The middle 
row of panels shows the estimated probabilities of the differences in log likelihoods being 
positive (e.g., the proportion of instances when the correct model was favored over the in- 
correct one), assuming normality of the differences in the log likelihoods. The lower row 
of panels shows the estimated differences in these probabilities between ICE and vanilla. 



of the simulated deletion (and back after the deletion), versus staying in the 
normal state throughout. In other words, we calculated the difference of the 
log likelihood of the true state sequence minus the log likelihood of assigning 
the normal state s to all SNPs. 

The upper row of panels in Figure 7 indicate the distributions of the 
differences in the log likelihoods for both the vanilla (light grey) and ICE 
(dark grey) HMMs, shown for the deletions of different sizes, and using four 
different scale parameters K. For the first two panels, the variability in the 
simulated copy number estimates in the deleted region was less than in the 
original data (the standard deviations were reduced to 40% and 70% of the 
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original, respectively), and for the fourth panel the standard deviation in 
the simulated copy number estimates in the deleted region was increased by 
30%. The middle row of panels shows the respective estimated probabilities 
of the differences in log likelihoods being positive, for example, the propor- 
tion of instances when the correct model was favored over the incorrect one. 
The lower row of panels shows the difference in these probabilities between 
ICE and vanilla. Quite obvious is the fact that the ability to detect micro- 
deletions of a few SNPs depends on precise data, and the knowledge of that 
precision. For example, when the standard deviation of the simulated copy 
number estimates in the deletion was reduced to 40%, ICE was able to con- 
sistently detect even the smallest deletions, while vanilla was only able to do 
so for deletions of size 5 or larger (left panels). Naturally, larger deletions are 
easier to detect for both methods. As the quality of the data decreases (sim- 
ulated here as an increase in the variability of the copy number estimates in 
the deletion), the ability of ICE to detect the deletion suffers substantially, 
while vanilla is almost agnostic to these changes. When the standard devi- 
ation of the simulated copy number estimates in the deletion was increased 
by 30%, vanilla picked up the deletion more often than ICE (right panels). 
The reason for this is as follows: since the variability in the copy number es- 
timates is increased, the evidence of a deletion being present decreases, and 
ICE acknowledges this fact by incorporating the confidence estimates. Thus, 
the decrease in the proportion of instances where ICE favors a deletion over 
the normal state is a feature of the algorithm. The price to pay, otherwise, 
is in the number of false positives (i.e., the number of incorrectly inferred 
deletions at other loci). Simulating 200 "synthetic" normal chromosome lq 
arms with K = 1.3 across all SNPs, vanilla indicated spurious small dele- 
tions in 50 of these artificial chromosomal arms (for a total of 86 incorrect 
state predictions), while ICE indicated none. 

4. Discussion. Chromosomal DNA varies between individuals at the level 
of entire chromosomes, chromosomal segments, and changes in small ge- 
nomic regions down to one nucleotide (including single nucleotide polymor- 
phisms, SNPs). Many of these variations appear to be completely benign, 
but some are known or suspected to be associated with disease. Associa- 
tion studies often use some SNPs (in candidate gene studies) or hundreds of 
thousands of SNPs (in genome wide association studies) as potential candi- 
dates or markers of genes to investigate the relationship between genotype 
and phenotype. However, the abundance of copy number variations in the 
human genome and their role in disease have played an increasingly promi- 
nent role. In particular, the "common disease, common variant" paradigm 
has been challenged for some diseases [McClellan, Susser and King (2007); 
see, e.g., Sebat et al. (2007) for a case study on autistic and apparently 
normal subjects]. Undoubtedly, this change is due in part to the recent 
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technological advancement, in particular, on high density single nucleotide 
polymorphism (SNP) microarrays which allow for the detection of these al- 
terations. Besides copy number variations such as deletions and duplications, 
copy- neutral stretches of homozygosity can also be of scientific interest, as 
uniparental disomy as one such example has been implicated in disease. 

Copy number variations and loss of heterozygosity can arise through so- 
matic and germline events. In this manuscript, we developed methods most 
applicable for abnormalities as a consequence of germline events. Undoubt- 
edly, the stochastic process as defined by our transition probability could 
be too rigid for the analysis of data arising from a cancer sample, where 
microdeletions as well as a loss of an entire chromosomal arm might be 
present. Further, noninteger copy numbers do make sense in such samples 
due to the mix of normal and abnormal cells in the sample [i.e., mosaicism; 
see Ting et al. (2006) for an example], while we assume the copy numbers 
to be integers in our approach. While rare, noninteger copy numbers may 
occur even in "normal" genomes (this can occur throughout the body or in 
specific regions), and thus, may pose a problem for our algorithm. In general, 
even if our method could be extended to allow for noninteger copy numbers 
(at least the HMM for copy numbers, since the definition of "genotype" is 
unclear in such a setting), the ability to pick up noninteger copy numbers 
obviously depended on the quality of the data, the length of the non-normal 
region, and the actual value of said copy number. For example, delineating 
a small mosaic region in a sample with 95% normal cells and 5% of cells 
with a hemizygous deletion would likely not be possible. 

Our paper builds on a modular approach for analyzing SNP chip data, 
extending the functionality of statistical algorithms that pre-process probe- 
level data to produce SNP-level summaries of genotype and copy number. 
Noticeably, these approaches have mostly been developed for the Affymetrix 
platform (such as CRLMM for improved genotype estimates), but our ideas 
are portable to other high throughput platforms such as Illumina. In partic- 
ular, the vanilla HMM only relies on genotype (CN) and copy number (GT) 
estimates without any confidence scores, which can be exported directly from 
the Beadstudio software (http://www.illumina.com/). With one notice- 
able (and very recent) exception suggested by Colella et al. (2007), previous 
approaches using HMMs have considered genotype and copy number sepa- 
rately, not simultaneously in a single unifying statistical model that allows 
for the detection of copy number changes as well as copy neutral stretches 
of homozygosity in the genome. In this sense, this manuscript is not the first 
to propose such a unifying approach, albeit ours differs in several aspects 
from the Bayesian HMM of Colella et al. (2007). In particular, the incorpo- 
ration of uncertainty estimates can be critical, for example, in the detection 
of microdeletions. The investigation of one particular sample as discussed in 
this manuscript, however, does not allow for conclusive statements how the 
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detected alterations are associated with the phenotype. In particular, it has 
been well established that copy number variations and copy neutral stretches 
of homozygous genotypes are prevalent in many phenotypically normal indi- 
viduals. Identifying features that may be associated with a particular pheno- 
type are better handled by statistical models for between-sample variation 
in studies with phenotypically normal and diseased populations. Such mod- 
els reside in the next tier of our modular approach to the analysis of SNP 
chip data and are an extension of the ideas presented here. 

In summary, we developed a HMM for SNP chips using the joint ob- 
servation sequence of copy number (CN) and genotype (GT) estimates as 
input. We demonstrated that a HMM model that uses both CN and GT can, 
for example, distinguish copy-neutral LOH from deletion-induced LOH. We 
also demonstrated how pre-processing algorithms that provide confidence 
scores of SNP-level summaries can be integrated into the emission proba- 
bilities of the HMM to control smoothing in a probabilistic framework, and 
showed that this can lead to much improved results. Specifically, confidence 
estimates allow smoothing to be more local or global depending on the un- 
certainty of the pointwise estimates. We demonstrated how high confidence 
scores helped in identifying a very small amplification otherwise missed (Fig- 
ure 4, region E), while low confidence scores for CN and GT had the desir- 
able effect of providing a more global smoothing (Figure 5). In particular, in 
the experimental data example, this helped to reduce the number of regions 
identified as LOH in the vanilla HMM, and eliminated the (presumably, spu- 
rious) indication of a small deletion and a small amplification. We believe 
that the ability to detect microdeletions and microamplifications could be of 
utmost importance to explain the genetic basis of many diseases. Undoubt- 
edly, this ability will greatly depend not only on the number of markers 
investigated (such as the number of SNPs used on a particular platform) 
and the quality of the data produced (i.e., the precision of the genotype and 
copy number estimates), but also on how the uncertainty of the estimates 
is utilized. In this sense, we hope that our method and software provides a 
useful tool for the scientific community. 
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